Preserve newlines in Table and TableChunk elements during PDF partitioning #4214

eureka928 · 2026-01-27T18:10:05Z

Summary

This PR fixes an issue where newline characters were being incorrectly stripped from Table and TableChunk elements during PDF partitioning. The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied indiscriminately to all Text elements, including tables, which removed newlines that carry structural meaning (such as row separation).

Changes

unstructured/partition/pdf.py: Added conditional logic to skip whitespace normalization for Table and TableChunk elements, preserving newlines that convey tabular structure
CHANGELOG.md: Added entry documenting the fix
unstructured/__version__.py: Version bump to 0.18.33

Problem

When processing PDFs (especially image-based PDFs with tables), the code applied this regex substitution to all Text elements:

el.text = re.sub(RE_MULTISPACE_INCLUDING_NEWLINES, " ", el.text or "").strip()

This stripped meaningful line breaks from table content, degrading the structural representation of tabular data.

Solution

Added a check to exclude Table and TableChunk elements from the whitespace normalization:

# Skip newline normalization for Table/TableChunk - newlines carry structural meaning
if not isinstance(el, (Table, TableChunk)):
    el.text = re.sub(
        RE_MULTISPACE_INCLUDING_NEWLINES,
        " ",
        el.text or "",
    ).strip()

badGarnet · 2026-01-28T01:42:10Z

the ingest test is failing because multiple white space used to be replaced with just one but now they remain multiple ones -> results in text changed

the ticket only asked for new lines to be preserved and that seem reasonable for tables
but multiple white space (excluding new lines, so like two space together ) should still be replaced with just one to improve readability of the extracted content

…rtitioning The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied to all Text elements, including Table and TableChunk. This incorrectly removed newline characters that carry structural meaning in tables (row separation). Fixes Unstructured-IO#3983

Co-Authored-By: Claude Opus 4.5 <[email protected]>

eureka928 force-pushed the fix/preserve-table-newlines branch 2 times, most recently from 3fc5a33 to 318330b Compare January 27, 2026 18:21

eureka928 and others added 3 commits January 28, 2026 03:55

fix: preserve newlines in tables while still collapsing multiple spaces

6d92543

chore: bump version to 0.18.33-dev1

4be5dc7

Co-Authored-By: Claude Opus 4.5 <[email protected]>

eureka928 force-pushed the fix/preserve-table-newlines branch from 9fb28db to 4be5dc7 Compare January 28, 2026 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve newlines in Table and TableChunk elements during PDF partitioning #4214

Preserve newlines in Table and TableChunk elements during PDF partitioning #4214

eureka928 commented Jan 27, 2026 •

edited

Loading

Uh oh!

badGarnet commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preserve newlines in Table and TableChunk elements during PDF partitioning #4214

Are you sure you want to change the base?

Preserve newlines in Table and TableChunk elements during PDF partitioning #4214

Conversation

eureka928 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Problem

Solution

Uh oh!

badGarnet commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eureka928 commented Jan 27, 2026 •

edited

Loading